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Abstract 

We introduce RIANN (Ring Intersection Approximate 
Nearest Neighbor search), an algorithm for matching 
patches of a video to a set of reference patches in real-time. 
For each query, RIANN finds potential matches by intersect¬ 
ing rings around key points in appearance space. Its search 
complexity is reversely correlated to the amount of tempo¬ 
ral change, making it a good fit for videos, where typically 
most patches change slowly with time. Experiments show 
that RIANN is up to two orders of magnitude faster than 
previous ANN methods, and is the only solution that oper¬ 
ates in real-time. We further demonstrate how RIANN can 
be used for real-time video processing and provide exam¬ 
ples for a range of real-time video applications, including 
colorization, denoising, and several artistic effects. 


1. Introduction 

The Approximate Nearest Neighbor (ANN) problem 
could be defined as follows: given a set of reference points 
and incoming query points, quickly report the reference 
point closest to each query. Approximate solutions perform 
the task fast, but do not guarantee the exact nearest neigh¬ 
bor will be found. The common solutions are based on data 
structures that enable fast search such as random projec¬ 
tions (201 [191, orkd-trees (311271. 

When the set of queries consists of all patches of an im¬ 
age the result is an ANN-Field (ANNF). Barnes et al. (H de¬ 
veloped the PatchMatch approach for ANNF, which shows 
that spatial coherency can be harnessed to obtain fast com¬ 
putation. Algorithms that integrate traditional ANN meth¬ 
ods with PatchMatch’ spatial coherency yield even faster 
runtimes (161 ED- H is thus not surprising that patch¬ 
matching methods have become very popular and now lie in 
the heart of many computer vision applications, e.g., texture 
synthesis IT2l . image denoising a, super-resolution 03, 
and image editing (UBEll to name a few. 

In this paper we present an efficient ANNF algorithm 
for video. The problem setup we address is matching all 
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Figure 1. Video ANN Fields: (a) Input: Live stream of video 
frames, (b) A reference set of image patches, (c) For each frame, 
a dense ANN Field is produced by matching patches from the ref¬ 
erence set. This is done in realtime. 


patches of a video to a set of reference patches, in real-time, 
as illustrated in Figure In our formulation the reference 
set is fixed for the entire video. In fact, we could use the 
same reference set for different videos. Additionally, our 
reference set of patches is not restricted to originate from a 
single image or a video frame, as is commonly assumed for 
image PatchMatching. Instead, it consists of a non-ordered 
collection of patches, e.g., a dictionary Eda. This setup 
enables real-time computation of the ANN Fields for video. 
We show empirically, that it does not harm accuracy. 

Our algorithm is designed to achieve low run-time by re¬ 
lying on two key ideas. First, we leverage the fact that the 
reference set is fixed, and hence moderate pre-processing is 
acceptable. At pre-processing we construct a data structure 
of the reference set, that enables efficient indexing during 
run-time. Second, we rely on temporal-coherency to adapt 
our hashing functions to the data. The hashing we use is 
not fixed a-priori as in previous works (11121 (HI ESI ^ but 
rather it is tuned per query patch during runtime. In regions 
with high temporal change the hashing is tuned to be coarse, 
which leads to higher computation times (larger bins trans¬ 
late to checking out more candidate matches). On the con¬ 
trary, in regions with low change, the hashing is finer and 
hence the computation time is lower. We refer to this ap¬ 
proach as “query-sensitive hashing”. Our hashing is generic 
to any distance metric. We show examples for working 




































with 2D spatial patches, however, our algorithm is easily 
extended to work with 3D spatio-temporal patches as well. 

To confirm the usefulness of the proposed approach we 
further discuss how it can be adopted for real-time video 
processing. We show that a broad range of patch-based im¬ 
age transformations can be approximated using our nearest 
neighbor matching. Specifically we provide examples of 
denoising, colorization and several styling effects. 

The rest of this paper is organized as follows. We start by 
reviewing relevant previous work on image and video ANN 
in SectionWe then present our hashing technique and its 
integration into a video-ANN Fields solution in Section 
We compare the performance of our algorithm to previous 
work in Section]^ and suggest several applications in Sec¬ 
tion Further discussion is provided in Section and our 
conclusions are laid out in Section |7l 

2. Related Work 

The general problem of Approximate Nearest Neighbor 
matching received several excellent solutions that have be¬ 
come highly popular (201 [H EH El ES- None of these, 
however, reach real-time computation of ANN Fields in 
video. Image-specific methods for computing the ANN 
Field between a pair of images achieve shorter run-times 
by further exploiting properties of natural images (H El EH 
MMM- In particular, they rely on spatial coherency 
in images to propagate good matches between neighbor¬ 
ing patches in the image plane. While sufficiently fast for 
most interactive image-editing applications, these methods 
are far from running at conventional video frame rates. It 
is only fair to say that these methods were not designed for 
video and do not leverage statistical properties of video. 

An extension from images to video was proposed by 
Liu & Freeman 1241 for the propose of video denoising 
through non-local-means. For each patch in the video they 
search for k Approximate Nearest Neighbors within the 
same frame or in nearby frames. This is done by propagat¬ 
ing candidate matches both temporally, using optical flow, 
and spatially in a similar manner to PatchMatch El. One 
can think of this problem setup as similar to ours, but with 
a varying reference set. While we keep a fixed reference 
set for the entire video, m use as different set of reference 
patches for each video frame. 

Another group of works that find matches between 
patches in video are those that estimate the optical fiow 
field ESI [301 [m Several of these achieve real-time 
performance, often via GPU implementation. However, the 
problem definition for optical fiow is different from the one 
we pose. The optical fiow field aims to capture the motion 
in a video sequence. As such, matches are computed only 
between consecutive pairs of frames. In addition, small dis¬ 
placements and smooth motion are usually assumed. So¬ 
lutions for large-displacement optical-fiow have also been 


proposed (HEHIIOlEIl- The methods of [81 EH ED inte¬ 
grate keypoint detection and matching into the optical fiow 
estimation to address large displacements. Chen et al. ca 
initiate the flow estimation with ANN fields obtained by 
cshED. None of these methods are near real-time perfor¬ 
mance, with the fastest algorithms running at a few seconds 
per frame. Furthermore, while allowing for large displace¬ 
ments, their goal is still computing a smooth motion field, 
while ours is obtaining similarity based matches. 

In Section El we show how our ANNF framework can 
be used for video processing. The idea of using ANNF 
for video processing has been proposed before, and several 
works make use of it. Sun & Liu [^ suggest an approach 
to video deblocking that considers both optical fiow esti¬ 
mation and ANNs found using a kd-tree. An approach that 
utilizes temporal propagation for video super-resolution is 
proposed in ll25]| . They as well rely on optical fiow estima¬ 
tion for this. The quality of the results obtained by these 
methods is high but this comes at the price of very long run¬ 
times, often in the hours. 

Our work is also somewhat related to methods for high 
dimensional filtering, which can be run on patches in video 
for computing non-local-means. Brox et al. (71 speed-up 
traditional non-local-means by clustering the patches in a 
tree structure. Adams et al. m and Gastal et al. d propose 
efficient solutions for high-dimensional filtering based on 
Gaussian kd-trees. None of these methods provide real-time 
performance when the filter is non-local-means on patches 
(unless harsh dimensionality reduction is applied). 

3. Ring Intersection Hashing 

Current solutions to computing the ANN Field between 
a pair of images utilize two mechanisms to find candidate 
matches for each query patch: spatial coherency, and ap¬ 
pearance based indexing. Some algorithms rely only on the 
first, while others integrate both. To allow for a generic ref¬ 
erence set (rather than an image) we avoid reliance on spa¬ 
tial coherency altogether. Instead, we rely on an important 
characteristic of video sequences: there is a strong temporal 
coherence between consecutive video frames. We harness 
this for efficient appearance-based indexing. 

3.1. Temporal Coherency 

Let qx,y,t denote the query patch q at position x^y in 
frame t of the input video. Our first observation is that there 
is strong temporal coherency between consecutive patches 
qx,y,t-i^qx,y,t- In Other words, most patches change slowly 
with time, and hence patch qx,y,t-i is often similar to patch 
qx,y,t’ Onr second observation is that temporal coherency 
implies coherency in appearance space: if patch qx,y,t-i 
was matched to patch of the reference set, then patch 
qx,y,t is highly likely to be matched to patches similar to 
in appearance. This is illustrated visually in Figure]^ (a). 
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Figure 2. Coherency in appearance: (a) When consecutive query 
patches qx,y,t-i and qx,y,t are similar, their exact NNs in the ref¬ 
erence set Vi and Vj are also similar, (b) The histogram depicts the 
distribution of distances between such pairs of reference patches 
Vi and Tj. It shows that for most queries the exact NN lies 

in a c? = 1 radius ball around the NN of the predecessor patch 
qx,y,t — l- 


To evaluate the strength of coherency in appearance we 
performed the following experiment. For each 8x8 patch 
in a given target video we find its exact nearest neighbor in 
a reference set of patches. For this experiment the reference 
set consists of all patches of a random frame of the video. 
Let patches Vj be the exact NNs of patches qx,y,t-i and 
Qx,y,t^ respectively. We then compute the distance in ap¬ 
pearance between the matches : d = 11 —r^ 11 2 . This was re¬ 

peated over pairs of consecutive patches from 20 randomly 
chosen videos from the Hollywood2 1^ data-set. The dis¬ 
tribution of distances d is presented in Figure|^(b), after ex¬ 
cluding patches where = Vj (effectively excluding static 
background patches). As can be seen, for ^ 85% of the 
patches d < 1. This implies that coherency in appearance 
is a strong cue that can be utilized for video ANN. 

3.2. Adaptive Hashing 

These observations imply that candidate matches for 
patch qx,y,t should be searched near the match of 
Qx,y,t-i- Furthermore, the search region should be adapted 
per query. In areas of low temporal change, a local search 
near should suffice, whereas, in areas of high temporal 
change the search should consider a broader range. 

So, how do we determine the search area for each query? 
The answer requires one further observation. If the refer¬ 
ence set is adequate then qx,y,t and its NN Vj are very close 
to each other in appearance space. This suggests that the 
distance between and Vj is close to the distance between 
Vi and qx,y,t, Fe., dist{ri,rj) ^ dist{ri, qx^y^t)- This is il¬ 
lustrated and verified empirically in Figurefor three sizes 
of reference sets. As can be seen, for set size of 900 patches 
or more, this assertion holds with very high probability. 

Based on the above, to find the match Vj we should 
search through a “fat” ring of radius dist{ri, qx^y^t), 
around as illustrated in Figure]^ (a). In areas with sig¬ 
nificant change qx,y,t will be far from r^, the ring radius 
will be large and will include many reference patches. On 
the contrary, in areas with little change qx,y,t will be near 


Vi, the ring radius will be small and will include only a 
few candidate reference patches. The width of the ring, 
denoted by 2e, could also be tuned to adapt the search 
space. We take the width 5 to be proportional to the ra¬ 
dius e — OL' dist{ri, Qx,y,t)- In our implementation and all 
experiments a = 0.25. Our rings are thus wider in areas of 
large changes and narrower in regions of little change. 

As can be seen in Figure |^(a), the ring around in¬ 
cludes the neighbors of qx,y,t^ but it also includes refer¬ 
ence patches that are very far from qx,y,t, e.g., on the other 
side of the ring. To exclude these patches from the set of 
candidates we employ further constraints. Note that our 
observations regarding are true also for any other point 
in the reference set. That is, if dist{rj,qx,y,t) is small, 
then dist{rk, rj) ^ dist{rk,qx,y,t) fof any patch rk of the 
reference set. Therefore, we further draw rings of radius 
dist{rk,qx,y,t) i around points selected at random 
from the current set of candidates. The final candidate NNs 
are those that lie in the intersection of all rings, as illus¬ 
trated in Figure [^(b). For each query we continue to add 
rings until the number of candidate NNs is below a given 
threshold L. In all our experiments the threshold is fixed to 
L = 20 candidates. The final match is the one closest to 
qx,y,t among the candidate NNs. 

3.3. Computational Complexity 

As a last step of our construction, we need to make sure 
that the rings and their intersections can be computed ef¬ 
ficiently during runtime. Therefore, at the pre-processing 
stage we compute for each reference patch a sorted list of 
its distances from all other reference patches. The compu¬ 
tation time of this list and the space it takes are both O(n^), 
where n is the size of the reference set. 

Having these sorted lists available, significantly speeds 
up the computation at runtime. For each query patch qx,y,t, 
we compute the distance d = dist{qx,y,t^'^i)^ where 
is the last known match. We add all reference points of 
distance d ± e from to the set of candidate NN. Thus, 
the computation complexity for each ring includes one dis¬ 
tance calculation, and two binary searches in a sorted array, 
O(logn). We continue to add rings and compute the inter¬ 
section between them, until the number of candidates < L. 

Figure explores empirically the relation between the 
number of rings and the amount of temporal change. It 
shows that the larger the change, the more rings are needed, 
and the higher the computation time is. As was shown in 
Figure [^(b), for most patches the change is very small, and 
hence the overall computation time is small. 

3.4. RIANN - Our Algorithm 

We name this approach RIANN (Ring Intersection 
ANN). RIANN is outlined in Algorithm Viewing this 
process as a hashing scheme, we say that RIANN is a query- 
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Figure 3. Predicting search radius: (Left) The distance dist{ri, qx,y,t) is a good predictor for the distance dist{ri, rj). (Right) Histograms 
of \dist{ri,rj) — dist{ri, qx,y,t) \ for three sizes of reference sets. This suggests that dist{ri, qx,y,t) is a strong predictor for dist{ri, rj). 
The correlation becomes stronger as we use a larger reference set. To exclude static background from our analysis we include only queries 
where ^ rj. Statistics were computed over 20 videos from the Hollywood2 database. 



Figure 4. Ring Intersection Hashing: (a) To find candidate neigh¬ 
bors for qx,y,t we draw a ring of radius d = dist{ri, qx,y,t) around 
ri. Here is the match found for qx,y,t-i’ (b) To exclude can¬ 
didates that are far from qx,y,t, we draw another ring, this time 
around rk, one of the current candidates. We continue to add rings, 
and leave in the candidate set only those in the intersection. 


sensitive hashing since it builds bins around the queries, as 
opposed to non-sensitive hashing that creates the bins be¬ 
forehand. Query-sensitive hashing avoids issues where a 
query lies at a boundary of a bin, thus reducing the chance 
of better candidates lying in neighbor bins. Temporal co¬ 
herency of adjacent frames leads to most queries lying in 
the vicinity of the current best match. Here, few intersec¬ 
tions are often enough to provide very few candidates. 

RIANN can find ANNs for each query patch, given the 
ANN of its predecessor patch and a reference set. Hence, 
to complete our solution for computing a dense ANN field 
for video, we need to define two more components: (i) 
how the reference set is constructed, and (ii) what we do to 
initialize the first frame. 

Building a reference model: We start by collecting a large 
set of patches. To build a global reference set, that can be 
used for many target videos, the patches are extracted ran¬ 
domly from a number of natural images. Since natural im¬ 
ages exhibit high redundancy, the collected set of patches is 
likely to include many similarities. Inspired by dictionary 
construction methods such as EEl we seek a more com¬ 
pact set that represents well these patches. To dilute patches 
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Figure 5. Search complexity determined by temporal change: 
(Top) Patch-level analysis: The mean number of ring intersections 
as a function of the temporal change \\qt-i—qt\\’ (Bottom) Frame- 
level analysis: The red curve indicates the total number of ring 
intersections performed over all queries in each frame. The blue 
curve represents the overall temporal difference between consecu¬ 
tive frames. This shows a clear correlation between the number of 
rings and the temporal difference between frames. 


with high resemblance we cluster the patches using a high 
dimensional regression tree lua. We then take the median 
of each cluster as the cluster representative and normalize 
it to be of length 1. Our final reference set consists of the 
normalized representatives (query patches are normalized 
at runtime). Last, we calculate the pairwise distance matrix 
and sort it such that column i contains an ascending order 
of distances from patch i. 

In some applications one might want to use a reference 
set that consists only of patches taken from the query 
video itself. For example, for methods inspired by Non- 
Local-Means. When this is the case, we randomly select a 
single frame from the video and collect all its patches. We 
then again cluster these patches using a high dimensional 


















regression tree ca and take the cluster representatives 
as our reference set. We refer to this as a local reference set. 

Initialization: RIANN requires an initial ANN field of 
matches for the first frame. We found that initializing 
these matches randomly, leads to convergence after very 
few frames. Hence, this was our initialization approach. 


Algorithm 1 RIANN: online search 
1: Input: video V, reference patch set R, sorted distance 
matrix D , maximum set size L = 20, ring width pa¬ 
rameter a = 0.25 

2: Output: {ANNF^*')} - dense ANN field for each 
frame t = 0, 1 , 2, ... 

3: Initialize: ANNF^^\x,y) ^ Uniform[l, |i?|] 

4: for t=l,2,... 

5: for each query patch Qx.y.t ^ 

fz. ^ J _ ^X,y,t 

6 - qx,y,t ^ 

7: ri= ANNF^*-^\x,y) 

8 : di = dist{qx,y,t,'ri) e = adi 

9: Initial candidate set: 

10- Sx,y,t = 5 «5.t. : di — 6 < dist{ri^rj) < 

di + e} 

11: while \Sx,y,t\ > L 

12: Choose a random anchor point G S^^y^t 

13: dk = dist{qx,y,t, Vk) \ e = adk 

14: Update candidate set: 

U: ^x,y,t ^x,y,t Cl j •> S.t. . dj^ ^ ^ 


dist{rk,rj) < dk F e] 

16: end while 

17: Find best match for qx,y,t in Sx,y,t • 

18: ANNFd)[x,y) ^ 

argmin^gs^ ^ ^ {dist{q^^y^us)} 


4. Empirical evaluation 

Experiments were performed on a set of 20 videos from 
the Hollywood2 dataset (261. Each video consists of 200 
frames and is tested under three common video resolutions: 
VGA (480x640), SVGA (600x800), and XGA(768xl024). 
For each video we compute the ANN field for all frames. 
We then reconstruct the video by replacing each patch with 
its match and averaging overlapping patches. To asses the 
reconstruction quality we compute the average normalized 
reconstruction error E = where for is the 

original frame and /r is the reconstructed frame by each 
method. Our time analysis relates to online performance, 
excluding pre-processing time for all methods. Patches are 
of size 8x8. 

To asses RIANN we compare its performance against 
previous work. We do not compare to optical-fiow meth¬ 
ods, since as discussed in Section they solve a different 


problem where matches are computed between pairs of con¬ 
secutive frames. One could possibly think of ways for in¬ 
tegrating optical fiow along the video in order to solve the 
matching to a single fixed reference set. However, this could 
require the development of a whole new algorithm. 

We do compare to three existing methods for image 
ANNE: PatchMatch d, CSH lED, and TreeCANN (T^ . 
These methods require a single image as a reference, there¬ 
fore, we take for each video a single random frame as refer¬ 
ence. For RIANN, we experiment with both a local model 
built from the same selected frame, and with a global refer¬ 
ence model built from patches taken from multiple videos. 
The same global reference model was used for all 20 tar¬ 
get videos. We do not compare to the video-ANNE of 
since they report executing 4 iterations of PatchMatch-like 
spatial propagation in addition to optical fiow computation. 
Their solution is thus slower than PatchMatch, which is the 
slowest of the methods we compare to. 

Figure [^displays the results. As can be seen, RIANN is 
the only patch-matching method that works in realtime. It 
offers the widest tradeoff between speed and accuracy, mak¬ 
ing it a good option for a wide range of applications. RI¬ 
ANN’s fastest configurations are up to hundreds of frames 
per second, making it attractive to be used as a preliminary 
stage in various realtime systems. In addition, RIANN of¬ 
fers configurations that outperform previous state of the art 
algorithms (HEIlIIIl in both accuracy and speed. 

In the experiments presented in Figure the ANNF for 
each frame was computed independently of all other frames. 
A possible avenue for improving the results of image-based 
methods is to initialize the matches for each frame using 
the matches of the previous frame. We have tried this with 
PatchMatch and found that the speed barely changed, while 
accuracy was slightly improved. For CSH and even more so 
for TreeCANN doing this is not straightforward at all, e.g., 
for TreeCANN one would need a new search policy that 
scans the tree from the leaves (the solution of the previous 
frame) rather than the root. We leave this for future work. 

Changing the size of the reference set affects our time- 
accuracy tradeoff. A large reference set helps approximate 
the queries with greater precision on the account of a slower 
search. A small reference set leads to a quicker search at 
the account of lower accuracy. It can be seen that when we 
use a local reference-set, i.e., when the patches are taken 
from a single frame, RIANN’s accuracy is lower than that 
of the other methods. This is probably since we cluster the 
patches while they use the raw set. However, this is resolved 
when using a global reference set constructed from patches 
of several different videos. In that case RIANN’s accuracy 
matches that of previous methods. 

A disadvantage of RIANN is a bad scaling of memory 
footprint (0(n^)), resulting from the storage of the n x n 
matrix D of distances between all reference patches (n is 
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Figure 6. Runtime vs. accuracy tradeoffs, (a-c) A comparison of runtime and accuracy shows RIANN is significantly faster than previous 
approaches. In fact, RIANN is the only realtime solution. The curves for PatchMatch and CSH represent results for varying numbers of 
iterations. For TreeCANN we vary the grid sparsity, and for RIANN the reference set size. The labels next to the tickmarks correspond to 
the iterations (PatchMatch,CSH), grid-size (TreeCANN), and set-size (RIANN), correspondingly. Pre-processing time was excluded for all 
methods. For RIANN-global a single pre-processed reference set was used for all the test videos. For CSH, TreeCANN and RIANN-local 
a single frame was pre-processed for each video. PatchMatch has no pre-processing, (d) The curves present the pre-processing time (blue) 
and memory footprint (red) of RIANN, for reference sets of corresponding sizes. 


the reference set size). The pre-processing time ranges from 
^ 10 [sec] for small models up to ^ 2 [min] for the max¬ 
imum size tested, see Figure [^(d). Hence, using a single 
global model is advantageous. 

5. RIANN for Real-Time Video Processing 

The proposed ANN framework could be useful for sev¬ 
eral applications. For example, in video conferencing, one 
could transmit only the grayscale channel and recolor the 
frames, in realtime, at the recipient end. Alternatively, in 
man-machine interfaces such as Kinect, one could apply re¬ 
altime denoising or sharpening to the video. In both scenar¬ 
ios, a short pre-processing stage is acceptable, while real¬ 
time performance at runtime is compulsory. 

A broad range of effects and transformations can be ap¬ 
proximated by video ANN, e.g., cartooning, oil-painting, 
denoising or colorization. This requires applying the ef¬ 
fect/transformation to each patch in the reference set at pre¬ 
processing time. Then, at runtime each query patch is re¬ 
placed with the transformed version of its match in the refer¬ 
ence set. Such an approximate, yet realtime, solution could 
be highly useful for real-world setups, when the complex¬ 
ity of existing methods is high. This is currently the case 


for several applications. For example, denoising a VGA 
frame via BM3D HD runs at ~5[sec], colorization 1^ 
takes ^15[sec], and some of the styling effects in Adobe® 
Photoshop® are far from running at realtime. 

For the approximation to be valid, the applied transfor¬ 
mation T should be Lipschitz continuous at patch level: 
dist{T{q),T{r)) < a • dist{q, r), where a is the Lipschitz 
constant, q is a. query patch and r is its ANN. a-Lipschitz 
continuous process guarantees that when replacing a query 
patch q with its ANN r, then T(r) lies in an a • dist{q, r)- 
radius Ball around T{q). A smaller a implies that T is ap¬ 
proximated more accurately. 

Lipschitz continuity of image transformations depends 
on the patch size. Working with bigger patches, increases 
the probability of a transformation T to be a-Lipschitz con¬ 
tinuous. For larger patches typically the matches are less 
accurate and the approximation error is larger. Therefore, to 
maintain low errors one needs a larger reference set as the 
patch size increases. To see this consider the extreme case 
where the patch size is 1 x 1. In this case, using a very small 
set of 255 patches (one for each gray level), a perfect recon¬ 
struction of any gray-level video is possible. However, most 
transformations are far from being Lipschitz continuous at 
this scale, hence they cannot be properly approximated via 
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Figure 7. Colorization: (Top) Example frames from input grayscale videos. (Bottom) The corresponding colored versions, computed 
automatically, at 30FPS, using RIANN. The colorization is quite sensible, with occasional errors in homogenous regions. (Right) The 
charts compare run-time (per frame) and Sum of Squared Differences (per pixel) averaged over our test set. RIANN is both faster and more 
accurate than Welsh et al. |33l 


patch-matching. Working with bigger patches increases the 
probability for Lipschitz continuity, but at the same time 
demands a larger reference set. 

Many video scenarios are characterized by high redun¬ 
dancy, e.g., video chatting, surveillance, indoor scenes, etc. 
In these scenarios a small reference set suffices to get satis¬ 
factory approximation quality. Our test set includes generic 
videos from the Hollywood2 dataset. Nevertheless, RIANN 
provided satisfactory results for these videos, for a variety 
of applications (frame resolution was 288 x 352, patch size 
8x8). Some example results of edited videos are provided 
in the supplementary. We next explain how some of these 
effects were realized and compare to prior art. 

Realtime Video Colorization: To color a grayscale video 
we construct a reference set of grayscale patches for which 
we have available also the color version. We use RIANN to 
produce the ANN Fields in realtime. Each query grayscale 
patch is converted to color by taking the two chromatic 
channels of the colored version of its ANN. Since patches 
of different colors could correspond to the same grayscale 
patch, the usage of a global reference set could result in 
inappropriate colors. Therefore, we use a local reference 
set constructed from one random frame of the target video. 
We generate color versions for its patches by coloring the 
frame offline manually (or using a semi-automatic coloring 
method such as Ea). The rest of the video is then colored 
automatically by RIANN. 

Example results are displayed in Eigure[7j and in the sup¬ 
plementary. It can be seen that our results are plausible most 
of the time, with a few glitches, mostly in homogeneous re¬ 
gions. We further compare our results with those of Welsh 
et al. (331, which also provides fully automatic colorization. 
Videos of our test set were converted to gray-scale and then 
re-colored by both RIANN and (331. We compare the aver¬ 
age 1/2 distance (SSD) per pixel between the original video 
and the colored one as well as the run-time. The results 
reported in Eigure show that RIANN is 3 orders of mag¬ 


nitude faster and still more accurate than (3^ . 

Realtime Video Denoising: To test denoising we add 
Gaussian noise at a ratio of = 7% . We then ex- 

Cfsignal _ 

tract one frame at random, and denoise it using BM3D uU . 
This is used to construct the local reference set. The rest of 
the frames are denoised online by replacing each patch with 
the denoised version of its ANN. Example results and quan¬ 
titative comparisons to BM3D CD (accurate) and Gaussian 
Eilter (fast) are provided in Eigure Eor both BM3D and 
the Gaussian filter we tuned the parameters to get minimal 
error. The results show that our goal was achieved, and 
while maintaining realtime performance we harm the ac¬ 
curacy only a little. 

Realtime Styling Effects: To show applicability to a wide 
range of image transformations, we apply a set of Adobe® 
Photoshop® effects to one frame of the video. The patches 
of this frame are used to construct the reference set. We then 
use RIANN to find ANNs and replace each patch with the 
transformed version of its match. We tested several effects 
and present in Eigure [^sample results for “Accent Edges”, 
“Glowing Edges”, “Photocopy” and “Eresco”. 

6. Discussion on Spatial Coherency in Video 

Spatial coherency is typically used for ANN Eields in 
images by propagating matches across neighboring patches 
in the image plane (H . At first we thought that spatial co¬ 
herency would also be useful for video ANN. However, 
our attempts to incorporate spatial coherency suggested oth¬ 
erwise. Therefore, we performed the following experi¬ 
ment. Our goal was to compare the accuracy of matches 
found when relying on spatial coherency, to the accuracy of 
matches based on appearance only. To compute the latter, 
we found for each patch in a given target video, its exact 
NN in a given reference set. To compute the former, we ap¬ 
plied PatchMatch O (set for 3 iterations) to the same target 
video, since PatchMatch relies heavily on spatial coherency. 
As a reference set we took all the patches of a single random 
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Figure 8. Denoising: The images are of two example noisy frames and their denoised versions obtained by Gaussian filter, BM3D, and 
RIANN. The charts compare run-time (per frame) and Sum of Squared Differences (per pixel) averaged over our test set. RIANN is almost 
as fast as Gaussian filtering while being only slightly less accurate than BM3D. 
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Figure 9. Styling effects: (Top) A reference image and several styling effects applied to it in PhotoShop. (Bottom) An example target frame 
and the approximated styling effects obtained by RIANN at 30 FPS. 


frame of the video. 

Results are presented in Figure [T^ As can be seen, 
PatchMatch error jumps significantly when the target and 
reference frames are from different shots. This occurs 
since there is low spatial coherency across visually differ¬ 
ent frames. On the contrary, the error of appearance-based 
matching increases only slightly across shot changes. This 
supports our approach, that relies solely on temporal co¬ 
herency to propagate neighbors in appearance space. 

7. Conclusion 

We introduced RIANN, an algorithm for computing 
ANN Fields in video in realtime. It can work with any 
distance function between patches, and can find several 
closest matches rather than just one. These characteris¬ 
tics could make it relevant to a wider range of applica¬ 
tions, such as tracking or object detection. RIANN is based 
on a novel hashing approach: query-sensitive hashing with 
query-centered bins. This approach guarantees that queries 
are compared against their most relevant candidates. It 
could be interesting to see how these ideas translate to other 
hashing problem. 



Figure 10. Limitations of spatial propagation in video: The curves 
correspond to the reconstruction error when replacing each patch 
of a video with its match in a reference frame. PatchMatch (41 uses 
spatial propagation, while exact NN is based solely on appearance. 
An abrupt increase of PatchMatch error with respect to the exact 
NN is evident in shots A,C. There, PatchMatch will require more 
than 3 iterations to achieve a low reconstruction error as in shot B. 
This happens because spatial propagation is less effective as the 
dissimilarity between reference and target frames increases. 
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